Skip to content

Conversation

eggfoobar
Copy link
Contributor

added a handler func during event watch to trigger a job on both nodes to restart the podman-etcd service

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 7, 2025
Copy link

coderabbitai bot commented Oct 7, 2025

Walkthrough

Adds a restart-etcd command to the setup runner; introduces an operator Secret event handler to detect etcd cert changes and trigger restart jobs; exposes a PCS API to restart etcd; extends job tooling with a restart-etcd type and timeout; and adds a runner that invokes the PCS restart.

Changes

Cohort / File(s) Summary
CLI command wiring
cmd/tnf-setup-runner/main.go
Imports restart-etcd package and registers a new Cobra command via NewRestartEtcdCommand that runs tnfrestartetcd.RunEtcdRestart, exiting fatally on error.
Operator: Etcd cert change handler
pkg/tnf/operator/starter.go
Adds Secret informer handler handleEtcdCertChange to detect changes in tlshelpers.EtcdAllCertsSecretName; serializes/debounces restarts via mutex/flag; lists control-plane nodes, starts JobTypeRestartEtcd jobs, waits for completion, then deletes preceding fencing jobs; integrates into Secrets Add/Update/Delete wiring; adds tlshelpers import and new globals (etcdCertUpdateTriggered, etcdCertUpdateMutex).
PCS: Etcd restart API
pkg/tnf/pkg/pcs/etcd.go
Adds exported RestartEtcd(ctx context.Context) error to check PCS resource status and restart the etcd resource if present, returning errors encountered.
Tools: Jobs types/timeouts
pkg/tnf/pkg/tools/jobs.go
Adds RestartEtcdJobCompletedTimeout constant and JobTypeRestartEtcd enum value; updates JobType.GetSubCommand() to return "restart-etcd" for this job type.
Runner: restart-etcd
pkg/tnf/restart-etcd/runner.go
Adds RunEtcdRestart() which creates a background context, logs start/end, calls pcs.RestartEtcd(ctx), and returns any error.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title accurately reflects the main functionality—adding logic for TNF certificate rotation—but the “WIP:” and “feat:” prefixes introduce unnecessary noise and do not contribute to clarity.
Description Check ✅ Passed The description succinctly explains that a handler function was added to trigger a podman-etcd restart job on both nodes, which directly corresponds to the secret-watch and restart logic implemented in the pull request.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between e4b1946 and 00d5025.

📒 Files selected for processing (5)
  • cmd/tnf-setup-runner/main.go (3 hunks)
  • pkg/tnf/operator/starter.go (4 hunks)
  • pkg/tnf/pkg/pcs/etcd.go (1 hunks)
  • pkg/tnf/pkg/tools/jobs.go (3 hunks)
  • pkg/tnf/restart-etcd/runner.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/tnf/pkg/pcs/etcd.go

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from jaypoulz and slintes October 7, 2025 21:35
Copy link
Contributor

openshift-ci bot commented Oct 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: eggfoobar
Once this PR has been reviewed and has the lgtm label, please assign mshitrit for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between e06c84b and e4b1946.

📒 Files selected for processing (5)
  • cmd/tnf-setup-runner/main.go (3 hunks)
  • pkg/tnf/operator/starter.go (4 hunks)
  • pkg/tnf/pkg/pcs/etcd.go (1 hunks)
  • pkg/tnf/pkg/tools/jobs.go (3 hunks)
  • pkg/tnf/restart-etcd/runner.go (1 hunks)

for _, node := range nodeList {
runJobController(ctx, tools.JobTypeRestartEtcd, &node.Name, controllerContext, operatorClient, client, kubeInformersForNamespaces)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid spawning restart job controllers on every secret event

runJobController spins up a long-lived controller (go ...Run(ctx, 1)). Calling it inside the cert-change handler means every secret update (and even re-list events) will start another controller instance per node, leading to unbounded goroutines and duplicated informers fighting over the same tnf-restart-etcd-job-* resources. Please start these controllers once (e.g. alongside the other TNF job controllers) and let the handler only manage job lifecycle triggers.

@eggfoobar
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-etcd-certrotation

Copy link
Contributor

openshift-ci bot commented Oct 7, 2025

@eggfoobar: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ovn-two-node-fencing-etcd-certrotation

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7ace3120-a3ca-11f0-859f-c3a8764c54a7-0

added a handler func during event watch to trigger a job on both nodes to restart the podman-etcd service

Signed-off-by: ehila <[email protected]>
@eggfoobar eggfoobar force-pushed the add-tnf-cert-rotation-handling branch from e4b1946 to 00d5025 Compare October 8, 2025 06:10
Copy link
Contributor

openshift-ci bot commented Oct 8, 2025

@eggfoobar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-etcd-scaling e4b1946 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-aws-disruptive-ovn e4b1946 link false /test e2e-aws-disruptive-ovn
ci/prow/verify 00d5025 link true /test verify
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown e4b1946 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-azure-ovn-etcd-scaling e4b1946 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-aws-disruptive e4b1946 link false /test e2e-aws-disruptive
ci/prow/e2e-metal-ovn-two-node-fencing e4b1946 link false /test e2e-metal-ovn-two-node-fencing
ci/prow/e2e-gcp-disruptive e4b1946 link false /test e2e-gcp-disruptive
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown e4b1946 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-aws-ovn-etcd-scaling e4b1946 link false /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-gcp-disruptive-ovn e4b1946 link false /test e2e-gcp-disruptive-ovn
ci/prow/e2e-aws-etcd-certrotation e4b1946 link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-metal-assisted e4b1946 link false /test e2e-metal-assisted
ci/prow/e2e-aws-cpms e4b1946 link false /test e2e-aws-cpms
ci/prow/okd-scos-e2e-aws-ovn 00d5025 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-etcd-recovery e4b1946 link false /test e2e-aws-etcd-recovery
ci/prow/e2e-gcp-ovn-etcd-scaling e4b1946 link false /test e2e-gcp-ovn-etcd-scaling

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this can work, because AFAIK there is a delay between the cert Secret being updated, and the actual files on the nodes being synced. I'm afraid the podman-etcd restart happens too early.
Also, I tried to create a cluster with this patch. Looks like it runs the new restart jobs too early, when the pacemaker cluster wasn't setup yet, which fails cluster creation:

$ k get job,pod
NAME                                      STATUS     COMPLETIONS   DURATION   AGE
job.batch/tnf-after-setup-job-master-0    Complete   1/1           4m35s      20h
job.batch/tnf-after-setup-job-master-1    Complete   1/1           4m34s      20h
job.batch/tnf-auth-job-master-0           Complete   1/1           6s         20h
job.batch/tnf-auth-job-master-1           Complete   1/1           7s         20h
job.batch/tnf-fencing-job                 Complete   1/1           4m31s      20h
job.batch/tnf-restart-etcd-job-master-0   Failed     0/1           20h        20h
job.batch/tnf-restart-etcd-job-master-1   Failed     0/1           20h        20h
job.batch/tnf-setup-job                   Complete   1/1           4m21s      20h

NAME                                      READY   STATUS      RESTARTS   AGE
pod/etcd-guard-master-0                   1/1     Running     0          20h
pod/etcd-guard-master-1                   1/1     Running     0          20h
pod/etcd-master-0                         4/4     Running     0          20h
pod/etcd-master-1                         4/4     Running     0          20h
pod/installer-3-master-1                  0/1     Completed   0          20h
pod/installer-5-master-0                  0/1     Completed   0          20h
pod/installer-5-master-1                  0/1     Completed   0          20h
pod/installer-6-master-0                  0/1     Completed   0          20h
pod/installer-6-master-1                  0/1     Completed   0          20h
pod/installer-7-master-0                  0/1     Completed   0          20h
pod/installer-7-master-1                  0/1     Completed   0          20h
pod/revision-pruner-7-master-0            0/1     Completed   0          20h
pod/revision-pruner-7-master-1            0/1     Completed   0          20h
pod/tnf-after-setup-job-master-0-8btwt    0/1     Completed   0          20h
pod/tnf-after-setup-job-master-1-h4k9r    0/1     Completed   0          20h
pod/tnf-auth-job-master-0-fwr26           0/1     Completed   0          20h
pod/tnf-auth-job-master-1-4z72f           0/1     Completed   0          20h
pod/tnf-fencing-job-8jzgf                 0/1     Completed   0          20h
pod/tnf-restart-etcd-job-master-0-2zzcw   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-0-4rpwm   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-0-5lqzp   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-0-lld2c   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-1-2c6dp   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-1-7vx2c   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-1-pssr6   0/1     Error       0          20h
pod/tnf-restart-etcd-job-master-1-tqhks   0/1     Error       0          20h
pod/tnf-setup-job-d9prb                   0/1     Completed   0          20h

$ k logs tnf-restart-etcd-job-master-0-2zzcw 
I1015 13:13:41.793135   14200 runner.go:12] Running TNF etcd restart
I1015 13:13:41.793260   14200 etcd.go:37] Checking pcs resources
I1015 13:13:41.793298   14200 exec.go:24] Executing: /usr/bin/nsenter -a -t 1 /bin/bash -c /usr/sbin/pcs resource status
I1015 13:13:43.190362   14200 exec.go:38]   stdout: 
I1015 13:13:43.190670   14200 exec.go:39]   stderr: Error: unable to get cluster status from crm_mon
crm_mon: Connection to cluster failed: Connection refused
I1015 13:13:43.190726   14200 exec.go:41]   err: exit status 1
E1015 13:13:43.190744   14200 etcd.go:41] exit status 1Failed to get pcs resource statusstdoutstderrError: unable to get cluster status from crm_mon
crm_mon: Connection to cluster failed: Connection refused
errexit status 1
F1015 13:13:43.190795   14200 main.go:126] exit status 1

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2025
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants